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Abstract 

Weather forecasting is mostly based on the outputs of deterministic numerical 
weather forecasting models. Multiple runs of these models with different initial condi- 
tions result in forecast ensembles which is are used for estimating the distribution of 
future atmospheric variables. However, these ensembles are usually under-dispersive 
and uncalibrated, so post-processing is required. 

In the present work Bayesian Model Averaging (BMA) is applied for calibrating 
ensembles of temperature forecasts produced by the operational Limited Area Model 
Ensemble Prediction System of the Hungarian Meteorological Service (HMS). 

We describe two possible BMA models for temperature data of the HMS and show 
that BMA post-processing significantly improves calibration and probabilistic forecasts 
although the accuracy of point forecasts is rather unchanged. 

Key words: Bayesian Model Averaging, continuous ranked probability score, ensemble 
calibration, normal distribution. 
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1 Introduction 



The main objective of weather forecasting is to give a robust and rehable prediction of future 
atmospheric states on the basis of observational data, prior forecasts vahd for the initial time 
of the forecasts and mathematical models describing the dynamical and physical behaviour 
of the atmosphere. These models numerically solve the set of the hydro-thermodynamic non- 
linear partial differential equations of the atmosphere and its coupled systems (like surface or 
oceans for instance). The difficulty with these numerical weather prediction models is that 
since the atmosphere has a chaotic character the solutions strongly depend on the initial con- 
ditions and also on other uncertainties related to the numerical weather prediction process. 
In practice, the results of such models are never fully accurate and forecast uncertainties 
should be also taken into account in the forecast preparation. A possible solution is to run 
the model with different initial conditions (since the uncertainties in the initial conditions 
are one of the most important sources of uncertainty) and produce an ensemble of forecasts. 
The forecast ensemble can estimate the proba bility distribution of future w eather variables 
which allows probabilistic weather forecasting (iGneiting and Rafteryl . |2005| ). where not only 
the future atmospheric states are predicted, but also the related uncertainty information 
(which is indeed a valuable extension to the so called deterministic approach, where only 
the forecast is given with out uncertainty estimation). The ensemble pre diction method was 



proposed bvlLeithI (Il974r) and since its first operational implementation (IBuizza et all Il993 



Toth and Kalnayl . 119971 ) it has become a widely used technique all over the world and the 



users understand more and more its merits and economic value as well. However, although 
e.g. the ensemble mean on average yields better forecasts of a meteorological quantity than 
any of the individual ensemble members , it is often the cas e that the ensemble is under- 
dispersive and in this way, uncalibrated (IBuizza et all 120051 ). so that calibration is needed 
to account for this deficiency. 

The Bayesian Model Averagin g (BMA) method for post-processing ensembles in order to 
calibrate them was introduced by iRaftery et al\ ( 120051 ). The basic idea of BMA is that for 
each member of the ensemble forecast there is a corresponding conditional probability density 
function (PDF) that can be interpreted as the conditional PDF of the future weather quantity 
provided the considered forecast is the best one. Then the BMA predictive PDF of the future 
weather quantity is the weighted sum of the individual PDFs corresponding to the ensemble 
members and the weights are based on the relative performance of the ensemble members 
during a given training period. In practice, the performance of the individual ensemble 
members should have a clear characteristic (and not a random one) or if it is not the cas e 
th is fact should be take n into account at the calibration process (see e.g. iFraley et a/.l . 120101 ). 
In lRaftery et all ( 120051 ) the BMA method was successfully applied to obtain 48 hour forecasts 
of surface temperature and sea level pressure in the North American Pacifi c Northwest based 
on th e 5 members of the University of Washington Mesoscale Ensemble ( iGrimit and Massl . 



20021 ). These weather quantities can be modeled by normal distributions, so the predictive 
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PDF is a Gaussian mixture. Later, ISloughter et al\ (120071 ) developed a discrete-continuous 
BMA model for precipitation forecasting, where the discrete part corresponds to the event 
of no precipitation, while the cubic root of the prec i pitati on amount (if it is positive) is 
modeled by a gamma distribution. In ISloughter et al\ (120101 ) the BMA method was used for 
wind speed forecasting and the component PD Fs follow gamma distributions. Finally, using 
von Mises distribution to model angular data iBao et al\ (120101 ) introduced a BMA scheme 
to predict surface wind direction. 

In the present work the BMA method is applied for calibrating ensemble tempera- 
ture forecasts produced by the operational Limited Area Model Ensemble Prediction Sys- 
tem (LAMEPS) of the Hungarian Meteorological Service (HMS) called ALADIN-HUNEPS 
dHagei I2OI0I : iHoranvi et oAmvm . ALADIN-HUNEPS covers a large part of Continental 



Europe with a horizontal resolution of 12 km and it is obtained by dynamical downscal- 
ing (by the AL ADIN hmited a rea raodel) of the glob a l ARP EGE based PEARP system of 
Meteo France (IHoranvi et alx . l2006l : iDescamps et alx . l2009l ). The ensemble consists of 11 
members, 10 initialized from perturbed initial conditions and one control member from the 
unperturbed analysis. This construction implies that the ensemble contains groups of ex- 
changeable forecasts (the ensemble members cannot be distinguished, thus it is not possible 
to depict a systematic behaviour of ea ch member), s o for post-processing one has to use the 
modification of BMA as suggested by iFraley et al\ (120 lOl ). We remark, that BMA method 
has already b een successfu ll y app lied for wind speed ensemble forecasts of the ALADIN- 
HUNEPS - in iBaran et al\ (120131 ) it is shown that BMA post-processing of these forecasts 
significantly improves the calibration and the accuracy of point forecasts as well. Now this 
latter work is extended towards the calibration of the 2m temperature ensemble forecasts. 



2 Data 

As it was mentioned in the introduction, BMA post-processing of ensemble predictions was 
applied for temperature data produced by the operational ALADIN-HUNEPS system of the 
HMS. The data file contains 11 member ensemble (10 forecasts started from perturbed initial 
conditions and one control) of 42 hour forecasts for 2m temperature (given in Kelvin) for 
10 major cities in Hungary (Miskolc, Szombathely, Gyor, Budapest, Debrecen, Nyiregyhaza, 
Nagykanizsa, Pecs, Kecskemet, Szeged) together with the corresponding validating observa- 
tions, for the period between October 1, 2010 and March 25, 2011. The forecasts are initial- 
ized at 18 UTC. The data set is fairly complete, since there are only two days (18.10.2010 and 
15.02.2011) when three ensemble members are missing for all sites and one day (20.11.2010) 
when no forecasts are available. 



Figure [T] shows the verification rank histogram of the raw ensemble, that is the histogram 
of r anks of valid ating observations with respect to the corresponding ensemble forecasts (see 
e.g. lWilksl . l201ll . Section 8.7.2). It is clear that this histogram is far from the desired uniform 
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Figure 1: Verification rank histogram for 2m temperature of the 11-member ALADIN- 
HUNEPS ensemble. Period: October 1, 2010 - March 25, 2011. 



distribution, in many cases the ensemble members either underestimate, or overestimate the 
validating observations (the ensemble range contains the observed near-surface temperature 
values only in 46.36% of the cases). Hence, the ensemble is under-dispersive and in this way 
it is uncalibrated. In case of proper calibration the probability of a validating observation 
being below the ensemble range would be 1/12 and we would have the same probability 
for the observation being above it. In this way the probability 10/12 (i.e. 83.33%) can be 
considered as the nominal value of the ensemble range. 



3 The model and diagnostics 



In order to calibrate the ALADIN- HUNEPS ense r nble f orecasts for temperature the mod- 
ification of BMA norn a al mo del of iRafterv et ali (120051 ) for ensembles with exchangeable 



members (IFraley et all |2010| ) was use d. The group i ng of ensemble members was similar to 



the case of wind speed investigated in iBaran et all (120131 ). In the first model we have two 
exchangeable groups: one contains the control denoted by fc, the other one the 10 ensemble 
members corresponding to the different perturbed initial conditions which are denoted by 
fe^i, . . . , fe^io, respectively. We assume that the probability density function (PDF) of the 
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forecasted temperature x equals: 



P{x\fcjt 



fi 



lOi f^c.O) f^c,!; 



(3.1) 



1-u 
10 



10 



X^^(a;|/£j,&£,o,&£,i,c^^), 



where a; G [0,1], and g{x] f,bQ,bi,o''^) is a normal PDF with mean Bq + bif (linear 
bias correction) and variance cr^. In this way the two groups have different mean param- 



eters bcfl, b. 



c,i and bi Q, and a common standard deviation parameter a. Mean 

parameters 6c,o, &c,i and bg^o, be^i are estimated with linear regression, while the weight 
parameter u and the variance o"^, by maximum likelihood method, using training data 
consisting of ensemble members and verifying observations from the preceding n days 
(training period). The maximurn of the likelihood function is found with the EM algorithm 
flMcLachlan and Krishnanl . 119971 ) and due to the normality of the model both the expectation 
(E) and the maximi zation (M) s t ep lea d to closed formulae which allows fast calculations. 
For more details see lFraley et al\ fl2010l ). Once the estimated parameters for a given day are 
available, one can use either the mean or the median of the predictive PDF (13. ip as a point 
forecast. 

We also investigate special cases when in model (13. ip only additive bias correction is 
present, that is 6c,i = &£,i = 1? and when we do not use bias correction at all, i.e. 6c,o = 
bfo = and b.i = bpi = 1. 



Now, similarly to wind speed data investigated in lBaran et all ( 120131 ). to obtain the ten 
exchangeable ensemble members only five perturbations are calculated and then they are 
added to (odd numbered me mbers) and subtract ed from (even numbered members) the un- 
perturbed initial conditions ( iHoranyi et all 120111 ) . Figure |2] shows the plume diagram of en- 
semble forecasts of 2m temperature for Debrecen initialized at 18 UTC, 22.10.2010. This dia- 
gram clearly illustrates that the behaviour of ensemble member groups {fi^i, fi^3, fe^5, /^,7, fi^g} 
and {/^,2, fe,4, fi,^, fi,8, fi,io} differ from each other (particularly look at the 24-36h and 
48-60h forecast ranges). Therefore, in this way one can also consider a model with three 
exchangeable groups: control, odd numbered exchangeable members and even numbered 
exchangeable members. This idea leads to the following PDF of the forecasted wind speed 



x: 



li^lfc, fe,i, fe,io; bc,o,bc,i,bo,o,bo,i,be,o,be,i,a'^;Uc,u;o,u;e) = Ucg{x\fc,bcfl,bc,i,cr^) (3.2) 

5 

+ {^og{x\fe,2j-i,bo,o,bo,i,o-'^) +uJeg{x\fe,2j,be,o,be,i,(y'^ 



where for weights Uc,uJo,uJe G [0, 1] we have Uc + + 5ue = 1, while the definition of 
the PDF g and of the parameters bcfl, &c,i; &o,0; ^o,i; ^e,0; ^e,i and remains the same 
as for the model (13. ip . Obviously, both the weights and the parameters can be estimated in 
the same way as before. 
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Figure 2: Plume diagram of ensemble forecast of 2m temperature for Debrecen initialized at 
18 UTC, 22.10.2010 (solid line: control; dotted line: odd numbered members; dashed line: 
even numbered members). 

Similarly to the two-group case, we also investigate model (13. 2p with additive bias cor- 
rection (6c,i = &o,i = be,i = 1) and without bias correction at all (6c,o = ^a,o = ^e,o = and 
bc,i = bo,i = hf.1 = 1). We also remark, that according to our experiments with the current 
data set, the use of different variance parameters for the different groups in models (13.11) and 
(13. 2p does not lead to a significant improvement in the performance of the corresponding 
forecasts. 

As an illustration we consider the data and forecasts for Debrecen for two different dates 
24.11.2010 and 21.12.2010 illustrating two typical situations for models and ([S^D with 
linear bias correction. Figures and[3)D show the PDFs of the two groups in model (13. ip . 
the overall PDFs, the median forecasts, the verifying observations, the first and last deciles 
and the ensemble members. The same functions and quantities can be seen in Figures [3t 
and [3ll, where in addition to the overall PDF we have the three component PDFs and 
three groups of ensemble members. On 24.11.2010 the spread of the ensemble members is 
reasonable (the ensemble range equals 3.3 K) but all ensemble members underestimate the 
validating observation (279.9 K). Obviously the same holds for the ensemble median (276.9 
K), while BMA median forecasts corresponding to the two- and three-group models (279.8 
K and 280.0 K, respectively) are quite close to the true temperature. A different situation is 
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Figure 3: Ensemble BMA PDFs (overall: thick black line; control: red line; sum of ex- 
changeable members on (a) and (b): light blue line; on (c) and (d): green (odd members) 
and blue (even members) lines), ensemble members (circles with the same colours as the 
corresponding PDFs), ensemble BMA median forecasts (vertical black line), verifying ob- 
servations (vertical orange line) and the first and last deciles (vertical dashed lines) for 2m 
temperature in Debrecen for models (a) 24.11.2010, (b) 21.12.2010; and (E2D: (c) 

24.11.2010, (d) 21.12.2010; with linear bias correction. 



illustrated on Figures [3]d and|3jl, where the ensemble range is only 1.4 K, but it contains the 
validating observation (274.3 K). The ensemble median (273.5 K) slightly underestimates 
the true temperature, but the BMA post-processing yields more accurate estimators also in 
this case. The median forecasts corresponding to models (13. ip and (13. 2p are 273.8 K and 
273.7 K, respectively. 

In order to check the performance of probabilistic forecasts based on models (13. ip and 
(13. 2p and the corresponding point forecasts, as a reference we use the ensemble mean and 
the ensemble median. We compare the mean absolute error (MAE) and the root mean 
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square error (RMS E) of these point forec a sts and also t he me an continuous ranked probabil- 
ity score (CRPS) (iGneiting and Rafteryl . 120071 : IWilkd . I2OIII ) and the coverage and average 



width of 83.33 % central prediction intervals of the BMA predictive probability distribu- 
tions and of the raw ensemble. The coverage of this central prediction interval allows a 
direct comparison to the raw ensemble, where the ensemble of forecasts corresponding to a 
given location and time is considered as a statist ical sample and the sample quantiles are 
calculated according to iHyndman and Fan! (119961 . Definition 7). We remark that for MAE 
and RMSE the optimal point forec asts are the median and the mean, respectively fiGneitingl . 



2011 



F{y) 



Pinson and Hagedornl . l2012l ). Further, given a cumulative distribution function (CDF) 
and a real number x, the CRPS is defined as 



crps (F, a;) := / (F(y) - l{j^>^}) dy. 
J —00 



The mean CRPS of a probability forecast is the average of the CRPS values of the pre- 
dictive CDFs and corresponding validating observations taken over all locations and time 
points considered. For the raw ensemble the empirical CDF of the ensemble replaces the 
predictive CDF. The coverage of a (1 — a)100 %, a G (0, 1), central prediction interval is the 
proportion of validating observations located between the lower and upper a/2 quantiles 
of the predictive distribution. For a calibrated predictive PDF this value should be around 
(1 -a)100%. 



4 Results 



Modeling and analysis of temperature data was performed with the help of the ensembleBMA 
package in R (IFraley et all l2009l . I2OIII ) . As a first step the length of the appropriate training 
period was determined, then the performances of the BMA post-processed ensemble forecasts 
corresponding to models (13. ip and (13. 2p were analyzed. 



4.1 Training period 



Similarly to iRaftery et al\ (120051 ) and iBaran et al\ (120 131 ) to determine the length of the 
training period to be used we compare the MAE values of BMA median forecasts, the 
RMSE values of BMA mean forecasts, the CRPS values of BMA predictive distributions and 
the coverages and average widths of 83.33 % BMA central prediction intervals for training 
periods of length 10, 11, . . . , 60 calendar days. In order to ensure the comparability of the 
results we consider verification results from 02.12.2010 to 25.03.2011 (114 days; being after 
the maximum training period length). 

Consider first the two-group model (13. ip . In Figure H] the CRPS values of BMA predic- 
tive distributions, MAE values of BMA median forecasts and RMSE values of BMA mean 
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Figure 4: Mean CRPS of BMA predictive distributions, MAE of BMA median and RMSE 
of BMA mean forecasts corresponding to two-group model (13 .1!) for various training period 
lengths and different types of bias correction (black: linear; blue: additive; red: no bias 
correction). 



forecasts for different bias correction methods (black: linear; blue: additive; red: no bias 
correction) are plotted against the length of the training period. First of all it is noticeable 
that the results are very consistent for each diagnostics, i.e. the curves are similar for all 
measures. The best verification scores are obtained without any bias correction: the CRPS, 
MAE and RMSE values are rather constant (on low values) with respect to the length of the 
training period, their relative standard deviations are 0.21 %, 0.35 % and 0.13 %, respectively. 
In case of linear bias correction the values of all diagnostics are decreasing with the increase 
of the length of the training period reaching their minima around the 32-35 days interval 
(there is an increase afterwards). Particularly, CRPS, MAE and RMSE take their minima 
at day 33, the corresponding values are 1.60, 2.22 and 2.88, respectively. For additive bias 
correction the patterns of the curves are very similar to that of the linear case. The minima 
of CRPS (1.61) and RMSE (2.89) are also reached at 33 days, while MAE takes its minimum 
of 2.24 at 34 days (the value at day 33 is very near to this minimum, in fact that is the 
second smallest one). 

Furthermore, Figure [5] shows the average width and coverage of the 83.33 % central 
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Figure 5: Average width and coverage of 83.33 % BMA central prediction intervals corre- 
sponding to two-group model (13 .ip for various training period lengths and different types of 
bias correction (black: linear; blue: additive; red: no bias correction). Dashed line indicates 
the nominal coverage value. 



prediction interval for different bias correction methods as a function of the training period 
length. Concerning the average width the worst results (largest values) are obtained for 
model (13. ip without bias correction and, similarly to the previous diagnostics, the curve 
is rather flat. The linear bias correction results in the narrowest prediction intervals, the 
average width is increasing until around 50 days, then shows a slight decreasing trend, so 
up to 48 days, shorter training periods yield sharper estimates. A similar behaviour can be 
observed in the case of additive bias correction (the maximum is reached at around 40 days) . 

As far as the coverage values are concerned, unfortunately, none of the considered models 
result in coverage reaching the nominal value of 83.33% (dashed line). However, the case 
without bias correction is the best having its maximal value at day 40 with 81.14%. Until 
that day the coverage values are slightly increasing and then decreasing afterwards (although 
the curve is rather fiat again). The linear bias correction is starting from the lowest value 
and reaching its maxima in the period of 30-40 days (actually, the maximum is at day 40, 
but the values at these 10 days are very similar). The additive bias correction starts higher 
and reaches its plateau at around day 30 keeping its rather constant values until the end 
of the examined periods. The maximum coverage values of the two bias corrected cases 
are slightly below 80 %, the additive one being better with a small margin. Comparing the 
average width and coverage it can be noticed that they have opposite behaviour, i.e. the 
average width values favour shorter training periods, while the coverage figures prefer longer 
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Figure 6: Mean CRPS of BMA predictive distributions, MAE of BMA median and RMSE 
of BMA mean forecasts corresponding to three-group model (13.2!) for various training period 
lengths and different types of bias correction (black: linear; blue: additive; red: no bias 
correction). 



ones. A reasonable compromise should be found, which is at the range of 20-40 days. 

All in all for all three versions of model (13. ip a 33 days training period seems to be a 
reasonable choice (particularly see conclusions based on Figure HI which are not compromised 
by the other two diagnostics at Figure E]) . 

The same conclusion can be drawn from Figures [6] and [7] for the variants of the three-group 
model (13. 2p . The overall behaviour of the different systems for the various diagnostics is very 
similar to that for the two-group model. The no bias correction option provides the best 
results all over the time periods in 4 of the 5 diagnostics (the exception is the average width). 
The additive bias correction slightly outperforms the linear one (except in terms of average 
width). In terms of specific values the minima for CRPS, MAE and RMSE, respectively, are 
reached again at day 33 for the linear and additive bias corrections. Regarding the average 
width the values are increasing until day 30 and then they are rather constant until day 50 
and decreasing afterwards. For the coverage the period of highest values are between days 
30 and 40 (having another maximum of similar value for the additive bias correction at the 
end of the period). Hence, the training period proposed for the two-group model can be 
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Figure 7: Average width and coverage of 83.33 % BMA central prediction intervals corre- 
sponding to three-group model (13.21) for various training period lengths and different types of 
bias correction (black: linear; blue: additive; red: no bias correction). Dashed line indicates 
the nominal coverage value. 



kept for the three-group model as well, so we suggest the use of a training period of length 
33 days for all the investigated BMA models. 



4.2 Ensemble calibration using BMA post-processing 

According to the results of the previous subsection, to test the performance of BMA post- 
processing on the 11 member ALADIN-HUNEPS ensemble we use a training period of 33 
calendar days. In this way ensemble members, validating observations and BMA models 
are available for the period 04.11.2010 - 25.03.2011 (just after the 33 days training period 
having 141 calendar days, since on 20.11.2010 all ensemble members are missing). 

To get a first insight about the calibration of BMA post-processed forecasts we consider 
probability integral transform (PIT) histograms. The PIT is the value of the pred ictive cu- 
mulative distribution evaluated at the verifying observations flRaftery et a/.l . l2005l ). which is 
providing a good measure about the possible improvements of the under-dispersive character 
of the raw ensemble. The closer the histogram is to the uniform distribution, the better the 
calibration is. In Figure E] the PIT histograms corresponding to all three versions of two- 
and three-group BMA models (13.11) and (13. 2p are displayed. A comparison to the verification 
rank histogram of the raw ensemble (see Figure [1]) shows that post-processing significantly 
improves the statistical calibration of the forecasts. However, in case of linear bias correc- 
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Figure 8: PIT histograms for BMA post-processed forecasts using two-group (13. ip and three- 
group (13.21) models. 



tion both PIT histograms are slightly under-dispersive, while for models with additive bias 
correction and without bias correction one can accept uniformity. The latter statement can 
be also derived from Table [1] where the significance levels of Kolmogorov-Smirnov tests for 
uniformity of the PIT values are listed. Based on these tests it can be concluded that the 
additive bias correction produces the best PIT histograms (the no bias correction case is just 
slightly worse). 





Bias correction 




linear 


additive 


none 


BMA model f]3.ip 


3.7 X 10"^ 


0.25 


0.20 


BMA model fl3.2jl 


2.9 X 10'^ 


0.22 


0.06 



Table 1: Significance levels of Kolmogorov-Smirnov tests for uniformity of PIT values cor- 
responding to models (13. ip and (13. 2p . 
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Mean CRPS 


MAE 


RMSE 




Bias corr. 




median 


mean 


median 


mean 




linear 


1.54 


2.15 


2.15 


2.77 


2.77 


BMA model ([3^1) 


additive 


1.52 


2.11 


2.12 


2.75 


2.75 




none 


1.52 


2.10 


2.11 


2.75 


2.75 




linear 


1.54 


2.14 


2.14 


2.77 


2.78 


BMA model 1^ 


additive 


1.52 


2.10 


2.11 


2.75 


2.75 




none 


1.52 


2.09 


2.10 


2.74 


2.74 


Raw ensemble 




1.73 


2.10 


2.07 


2.76 


2.72 



Table 2: Mean CRPS of probabilistic, MAE and RMSE of median/mean forecasts. 



In Table [2] verification measures of probabilistic and point forecasts calculated using BMA 
models fl3.ip and fl3.2l) are compared to the corresponding scores for the raw ensemble. One 
can observe that in all cases there is a significant improvement in the value of the mean 
CRPS, but the accuracy of the post-processed point forecasts is equal to or even slightly 
worse than the accuracy of the corresponding quantities derived from the raw ensemble. It 
can be further mentioned that the best scores are obtained when no bias correction is applied 
and generally the three-group model is better than the two-group one. 

Table [3] gives the coverage and average width of the 83.33% central prediction interval 
calculated using models ( 13. ip and fl3.2p . and the corresponding measures calculated from 
the raw ensemble. As before, in the latter case the ensemble of forecasts corresponding to a 
given location and time is considered as a statistical sample. The BMA prediction intervals 
calculated from all models are nearly three times as wide as the corresponding intervals of the 
raw ensemble. This comes from the small dispersion of the raw ensemble, see the verification 
rank histogram of Figure [TJ It means that although the raw ensemble looks rather sharp in 
terms of average width, it cannot be correct, since the ensemble does not provide sufficient 
spread in describing the true variability of the possible atmospheric states. Concerning 
calibration one can observe that the coverage of all BMA central prediction intervals is 





BMA model 03. ip 


BMA model 03.21) 


Raw ensemble 


Bias correction 


linear additive none 


linear additive none 




Average Width 
Coverage (%) 


6.63 6.92 7.22 
77.7 79.9 81.6 


6.62 6.91 7.21 
77.9 80.1 81.8 


2.38 
37.4 



Table 3: Average width and coverage of 83.33 % central prediction intervals. 
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Figure 9: Probabilities of freezing in Debrecen in the winter period 01.12.2010 - 28.02.2011 
calculated using the three-group model (13.21) with different bias correction methods (black: 
linear; blue: additive; red: no bias correction) and the raw ensemble (green). Ticks indicate 
days with observed temperature below 273 K. 



much closer to the nominal coverage than the coverage of the central prediction interval 
calculated from the raw ensemble, which is rather poor. The diagnostics based on Table [3] 
also confirm that model (13. 2p distinguishing three exchangeable groups of ensemble forecasts 
slightly outperforms model (13. ip and the best results are obtained without bias correction. 
The additive bias correction is slightly better than the linear one. All these results are 
in good agreement with the ones obtained during the determination of the training period 
length. 

We have decided to consider also other aspects of the probability density function of 
the raw and calibrated ensembles. On Figure [9] probabilities of freezing (i.e. temperature 
forecast below zero) in the city of Debrecen in the winter period 01.12.2010 - 28.02.2011 



Linear bias correction Additive bias correction No bias correction 




Days Days Day; 



Figure 10: BMA weights of the control member of the two-group model (13. ip . 
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Control 


Exchangeable members 




fc 


fi,i 


fe,2 


fe,3 


fiA 


fi,5 


fe,6 


fl,7 


As 


fe,9 


A 10 


MAE 


1.80 


1.98 


1.97 


2.09 


2.02 


2.08 


1.95 


2.04 


1.85 


1.84 


1.96 


RMSE 


2.28 


2.50 


2.47 


2.56 


2.49 


2.59 


2.39 


2.56 


2.33 


2.40 


2.43 



Table 4: MAE and RMSE of the control and exchangeable ensemble forecasts for the period 
18.02.2011 - 08.03.2011. 



calculated using the three group model fl3.2p with different bias correction methods and the 
raw ensemble are plotted, while ticks indicate the days with observed temperature below 
zero in Celsius (below 273 K). This figure immediately reveals that contrary to the observa- 
tions the raw ensemble mostly does not indicate any freezing probability (no members have 
below- zero temperature forecast). This weakness is dramatically improved at the calibrated 
ensembles, where probabilities have been significantly increased, consequently the calibrated 
ensembles have an enhanced prognostic skill than it is the case for the raw ensemble. This 
improvement is indeed vital for the operational application of the ensemble forecasts. 

Figure [To] shows the BMA weights of the control member u in three variants of the two- 
group model (13. ip . In case of linear bias correction we have a real mixture in 63.19 % of the 
forecasted days (none of the groups has a weight which is almost 1, e.g. above 0.99), while for 
additive bias correction and for model (13. ip without bias correction this rate equals 79.17% 
and 81.25 %, respectively. The values of u that are close to 1 in all three cases correspond to 
a time interval 18.02.2011 - 08.03.2011, when the control member of the ensemble gives much 
better forecasts than the ten exchangeable ensemble members (this can be clearly seen from 
Table H] where the MAE and RMSE values of the particular ensemble members are given 
for the above mentioned period). This special behaviour is therefore correctly diagnosed by 
the calibration with the provision of large weights for the control member. Furthermore, 
for model (13. ip with linear bias correction there is a long period between 29.12.2010 and 
16.01.2011 where u < 0.01 (which means that the control is practically not used for the 
calibration), but in the other two cases this phenomenon practically disappears. The average 
MAE and RMSE values for that period are also displayed for each ensemble member (see 





Control 


Exchangeable members 




fc 


fe,i 


fl,2 


fi,3 


fiA 


fe,5 


fifi 


A,7 


f£,8 


fi,9 


A, 10 


MAE 


2.62 


2.44 


2.75 


2.55 


2.75 


2.52 


2.55 


2.63 


2.67 


2.81 


2.67 


RMSE 


3.17 


3.00 


3.31 


3.07 


3.29 


3.02 


3.09 


3.16 


3.28 


3.37 


3.29 



Table 5: MAE and RMSE of the control and exchangeable ensemble forecasts for the period 
29.12.2010 - 16.01.2011. 
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Linear bias correction Additive bias correction No bias correction 
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Figure 11: BMA weights of the control (upper panels) and of the odd (middle panels) and 
even (lower panels) ensemble members of the three-group model (13. 2p . 



Table E]). It can be seen that the control member is not really worse than the other ones (it 
outperforms around half of the members), which is correctly "recognized" by the additive 
and no bias correction techniques. 

Finally, Figure [11] shows the BMA weights of the control and of the odd and even ensemble 
members corresponding to the three variants of the three-group model (13. 2p . In case of linear 
and additive bias correction on a part of the problematic period 18.02.2011 - 08.03.2011 
(between 20.02.2011 and 05.03.2011) the weight Uc of the control is still close to 1 (greater 
than 0.98), while for the model without bias correction this happens only on four days. In 
the remaining cases (89.58 %, 90.28 % and 97.22 % of the days, respectively) we have real 
mixtures of normal distributions (it is particularly remarkable for the no bias correction). On 
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the other hand for the period, when the control gets zero weight in the hnear bias correction 
it is again (correctly) partly disappearing in the additive and no bias correction cases. 



5 Conclusions 

In the present study the BMA ensemble post-processing method is applied to the 11 mem- 
ber ALADIN-HUNEPS ensemble of the HMS to obtain 42 hour calibrated predictions for 
2m temperature. Two different BMA models are investigated, one assumes two groups of 
exchangeable members (control and forecasts from perturbed initial conditions), while the 
other considers three (control and forecasts from perturbed initial conditions with positive 
and negative perturbations). In both cases three different treatments of bias are considered 
(linear and additive bias correction and no bias correction at all) and for all models a 33 
days training period is suggested. The comparison of the raw ensemble and of the prob- 
abilistic forecasts shows that the mean CRPS values of BMA post-processed forecasts are 
considerably lower than the mean CRPS of the raw ensemble, while there is no big change 
in the MAE and RMSE values of BMA point forecasts (median and mean) compared to the 
MAE and RMSE of the ensemble median and of the ensemble mean. This latter fact might 
mean that although the spread of the raw ensemble was too small, the median and mean 
of the point forecasts were sufficiently correct not having too much space for further im- 
provement. It is remarkable to notice that the real probabilistic forecasts as demonstrated 
by the probability of freezing for Debrecen have been significantly improved. For models 
with additive bias correction and without bias correction the coverage of the 83.33 % central 
prediction interval is rather close to the nominal value, while their PIT values fit the uniform 
distribution. From the six competing post-processing methods the overall performance of 
the three- group model without bias correction seems to be the best, while models with linear 
bias correction give the worst results. 

Finally, we conclude that BMA post-processing of the ALADIN-HUNEPS temperature 
ensemble forecasts significantly improves the calibration and the probabilistic forecasts, how- 
ever no significant changes are found in accuracy of point forecasts. The first two aspects 
encourage the operational application of the method, while the third one calls for some 
further investigations to see whether improvements can be made for the point forecasts as 
well. 
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